Switching Text-Based Image Encoders for Captioning Images With Text
نویسندگان
چکیده
Visual understanding, such as image caption generation, has received extensive attention. Describing images with textual information is one way to help people achieve barrier-free visibility. This study focuses on the text-based captioning (TextCaps) task. The TextCaps task more complex than traditional because it depends optical character recognition (OCR) and that appears in image. It also requires consideration of relationship between recognized objects OCR’s linguistic part In this study, we propose maximizing use multiple modalities an improve performance. We enrich OCR features using pre-trained Contrastive Language-Image Pre-training (CLIP) models. then introduce two additional attention models a transformer architecture strengthen representation modality. experimental results demonstrate our proposed method, which introduces multimodal four image-related modalities, outperforms existing methods for dataset.
منابع مشابه
Text-Guided Attention Model for Image Captioning
Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns t...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملImproving Automatic Image Captioning Using Text Summarization Techniques
This paper presents two different approaches to automatic captioning of geo-tagged images by summarizing multiple web-documents that contain information related to an image’s location: a graph-based and a statistical-based approach. The graph-based method uses text cohesion techniques to identify information relevant to a location. The statistical-based technique relies on different word or nou...
متن کاملText Alignment for Real-Time Crowd Captioning
The primary way of providing real-time captioning for deaf and hard of hearing people is to employ expensive professional stenographers who can type as fast as natural speaking rates. Recent work has shown that a feasible alternative is to combine the partial captions of ordinary typists, each of whom types part of what they hear. In this paper, we describe an improved method for combining part...
متن کاملAutomated closed-captioning using text alignment
The production of closed captions is an important but expensive process in video broadcasting. We propose a method to generate highly accurate off-line captions efficiently. Our system uses text alignment to synchronize program transcripts obtained for a video program with text produced by an automatic speech recognition (ASR) system. We will also describe the accuracy in both closed-caption te...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2023
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2023.3282444